Luke Terry

1 My Video

# <video width="320" height="240" controls>
#   <source src="usingvideoinrmd.mp4" type="video/mp4">
# Your browser does not support the video tag.
# </video>

2 Introduction

Baseball is a timeless sport which was popularized in the United States in the mid 1850s (Wikipedia (2021)). Like many other sports in the United States, watching baseball, talking about baseball, and making predictions about baseball is a popular pastime among many people.

Compared to other sports, baseball and statistics have a close relationship. Perhaps it’s something about the leisurely pace of a baseball game, or the easily measurable individual actions in each baseball play. Many people believe that the use of statistics in baseball was introduced by Billy Beane, the General Manager of the Oakland A’s, since this narrative was popularized in Michael Lewis’s 2003 book Moneyball (Lewis (2003)), and 2011 movie of the same name. Statistics in baseball, however, have been a consideration within coaching staff far before Billy Beane. There exist evidence of statisticians assisting baseball managers as early as 1940, and many general managers used statistical analysis to inform their decisions prior to Beane, such as Danny Evans of the Dodgers and Doug Melvin of the Brewers. Beane’s fascination with OBP was not a new development either. Sandy Alderson rebuilt the A’s strategy around OBP two decades before Billy Beane ever managed the A’s (Schwarz (2004)).

There even exists a special term for the art of analyzing baseball statistics: sabermetrics. It was coined by Bill James, and influential statistician in the field, and its name refers to SABR, and acronym for the society of american baseball research (Schwarz (2004))

Bill James in 2010

2.1 Variables

Each experimental unit in this case is one pitcher, or rather how one pitcher performed over the course of the 2019 season.

There are several statistics which are calculated based on each pitchers performance. These include traditional statistics, like Earned Run Average (ERA), Wins, and Losses, but I’ve also added a column for the hits a pitcher gives up per out they earn.

Each player is identified in the table by a playerID which is unique to them (believe it or not, there are several MLB players with the same name as other players throughout history).

The variables which I am interested in are any quantitative statistics of a pitcher, and the amount of hits per out they give up, a quantitative measure. I intend to find out which statistics (if any) predict the rate at which a pitcher gives up hits.

library(Lahman)

ptall <- Lahman::Pitching
pt <- subset(ptall, ptall$yearID == 2019) #subset the data so that only 2019 data is analyzed

pt$HPO <- pt$H/pt$IPouts #add hits per out column

pt <- subset(pt, pt$IPouts >= 5)
pt <- subset(pt, is.finite(pt$HPO))


library(DT)
datatable(subset(pt, select = c("playerID", "teamID", "ERA", "W", "L", "HPO")))
#add statcast data to pt --------------------------------------------------------

stc <- read.csv("data/statcast_pitching_data.csv")
#for some reason different baseball databases use different playerIDS
#There was no direct crosswalk between the Lahman database and Statcast, so I had to make it myself.
#The script which does this is in the idCrosswalkTable.R file.
cw <- readRDS("data/crosswalk.RDS")

#initialize vector for new column
velocity <- c()
spin_rate <- c()
release_extension <- c()
xba <- c()
eff_min_vel <- c()
babip <- c()
takes <- c()
launch_angle <- c()

#add statcast data to Lahman pitcher data frame.
for (i in pt$playerID){
  #iterate over every pitcher, get their stats
  mlbID <- cw[which(cw == i),2]
  playerStats <- stc[which(stc$player_id == mlbID),]
  
  #add stats to column vectors
  velocity <- c(velocity, playerStats$velocity)
  spin_rate <- c(spin_rate, playerStats$spin_rate)
  release_extension <- c(release_extension, playerStats$release_extension)
  xba <- c(xba, playerStats$xba)
  eff_min_vel <- c(eff_min_vel, playerStats$eff_min_vel)
  babip <- c(babip, playerStats$babip)
  takes <- c(takes, playerStats$takes)
  launch_angle <- c(launch_angle, playerStats$launch_angle)

}

#add new columnds to pt
pt <- cbind(pt, velocity)
pt <- cbind(pt, spin_rate)
pt <- cbind(pt, release_extension)
pt <- cbind(pt, xba)
pt <- cbind(pt, eff_min_vel)
pt <- cbind(pt, babip)
pt <- cbind(pt, takes)
pt <- cbind(pt, launch_angle)


ptmin30 <- subset(pt, pt$G >= 30) #remove inexperienced or one-off pitchers for preliminary plots

2.2 Data Collection

Luckily for me, baseball’s extra close relationship with statistical scrutiny means that copious amounts of reliable sabermetric data are available from various sources online, free of charge.

The specific data which I am using is a nearly exhaustive record of traditional baseball statistics compiled by Sean Lahman, called the Lahman Database(“The Lahman Baseball Database” (n.d.)). The data set contains standard batting and pitching statistics, for every player, by year, since 1871, although for the sake of brevity, my analysis will focus on the 2019 regular MLB season. Specifically, I intend to use linear regression to isolate factors which affect the rate at which pitchers give up hits.

The database was compiled from various sources by a team of volunteers, since it dates back to 1871. The season that I will focus on however (2019), was likely scraped from mlb.com since major league baseball now offers statistics in that online format.

The data itself, however, was mostly collected by a combination of volunteers and mlb staff. Certain physical characteristics of pitches are also automatically collected by sensors and cameras.

2.3 Why was it gathered?

At its core, the data here was gathered to help mlb coaches and team administrations make informed decisions. The ubiquity and accuracy of sabermetrics, however, have allowed the media to take a real interest in baseball statistics.

2.4 My Interest in the Data

My personal interest in the data does not, however, lie in making informed decisions or predictions of overall baseball games. Each year, the MLB runs a competition called Beat the Streak. The rules are simple: each day, you are allowed to pick one or two batters who you think will get a hit that particular day. For each of your batters, if they get a hit, your streak goes up by 1. If they have an at bat, but fail to get a hit, your streak resets to 0. If you get a streak of 57 during the regular season, you win 5.6 million dollars (“Official Rules: Beat the Streak” (n.d.)).

Obviously, this is a difficult challenge to achieve. The probability of winning it big follows a binomial distribution, with a probability of getting a hit equal to the probability that the chosen batter gets a hit in a day. The probability that a chosen batter gets a hit in a day ostensibly follows a binomial distribution, where the probabilty of a success in each Bernoulli trial is the players batting average, and the number of trials is the number of at bats the player is likely to get based on their position in the batting lineup. Below is a histogram of the seasonal batting averages for all major league players. To remove potential outliers, only players with more than 30 at bats in a season are included in this histogram.

library(ggplot2)
bt <- Lahman::battingStats() #initialize batting data

#create histogram
bt <- subset(bt, bt$AB > 30)
g = ggplot(bt, aes(x=BA))
g = g + geom_histogram(fill = "blue", binwidth = .1)
g = g + xlab("Batting Averages")
g = g + ylab("Frequency")
g = g + ggtitle("Distribution of Seasonal Major League Batting Averages")
g

As you can see, it will be difficult to predict with certainty that a player gets a hit in a day. The average batting average for this historical dataset is 0.239. You can pick players with high batting averages, of course, but those players may not always get many plate appearances, and their batting averages don’t get that much higher. For instance, the all time high batting average in this data set is still only 0.512. (The batting average is from Rudy Pemberton’s 1996 season with the Red Sox)

If we assume that a player as good as Rudy Pemberton can be chosen everyday (which is a generous assumption), and we assume that this player gets 4 plate appearances per game (which is the average amount that most places in the order will get, especially when players who have a higher batting average are chosen Douglas (2017)), then the probability of a hit is about 94%1.

Although this is a high probability, we have to keep in mind that we don’t just want to predict whether a player will get a hit in a day, we want to predict whether a player will get a hit in a day 57 days in a row. Thus if we were able to pick Rudy Pemberton every day, the probability of attaining a 57 day streak is 0.00203, or about 1 in 500.2

This answer, however, assumes that each hit is a totally independent trial, where the only factor at play is the player’s batting average. In reality, there are many factors which affect a baseball players ability to get hits. Conventional knowledge dictates that the pitcher which a batter is facing is the most important factor that dictates a player’s ability to get a hit (apart from the players themselves). Because of this, this analysis will focus on what factors may cause a pitcher to give up more hits.

3 Premliminary Plots and Interpretation of the data

In order to decide which variables to analyze using linear regression, it’s useful to make preliminary plots and analyze them subjectively for seemingly linear patterns. Below are a few which I’ve constructed using some traditional pitching statistics. All graphs are a plot of a traditional pitching statistic versus a pitcher’s Hits Per Out (HPO).

For the sake of the preliminary analysis, I’ve filtered the pitching data to only include pitchers who’ve pitched a total of 30 games over their whole career. Heuristically, this allows me to minimize the effect on outliers in the data, although this is an oversimplification, and may exclude legitimate data. This is an acceptable sacrifice to make for the preliminary plots, however, since later in this analysis I’ll take a more data-driven approach to eliminating outliers. Right now, I’m just looking for any kind of association.

As you can see, these three charts have vaguely linear patterns within them. Earned Run Average (ERA) refers to the amount of runs that a pitcher has, on average, given up per nine innings pitched. Expected batting average and Batting Average for Balls in Play are relatively new metrics to baseball, introduced through the Statcast system installed in all MLB ballparks in 2015 (“Statcast: Glossary” (n.d.)). Expected batting average refers to the likelihood that a batted ball becomes a hit, including home runs. Batting Average for Balls in Play (BABIP), however, refers to the liklihood that a batted ball in play (excluding home runs) becomes a hit.

One of many Hawk-Eye cameras used to collect Statcast Data

Of these three, I think that ERA is the most logical to analyze, given my goal to predict hits per inning given up. The other two are similar to the expected batting average against a hitter which, like Hits Per Out, is a statistic base on the amount of hits a pitcher gives up.

Although ERA is affected by hits (hits earn runs), the two aren’t directly tied together, since it’s possible to get a hit (or several hits), without giving up any runs. There could theoretically exist pitchers who give up many hits, but give up few bases and thus not many runs. Additionally, there could be pitchers who give up very few hits, but give up large amounts of bases, and especially home runs, and thus have a very high ERA but a low HPO.

4 Theory needed to carry out SLR

It is my belief that a pitcher’s HPO tends to increase as their ERA does. Thus, when choosing between pitchers, a pitcher with a higher ERA will tend to give up more hits. I would like to create a model which relates the rate at which a pitcher gives up hits (HPO) to the pitchers, ERA, so I can use the more ubiquitous ERA to predict which pitchers are likely to give up more hits.

As with many sources of real data, however, an unknown number of factors and randomness affects a pitcher’s ability to throw pitches, even if we assume that the skill of the pitcher does not change (or changes in a predictable career-aging curve ((n.d.))). In other words, in the preliminary plots, there seemed to be a vaguely linear association, but all of the points did not line up in a straight line. Because of this, a deterministic model cannot be used, and instead I will need to construct a probabilistic model.

One type of probabilistic model is simple linear regression (SLR). SLR assumes that there is a linear association between the variables, and uses data driven methods to construct a line which can be used to make predictions about y given x and a few constants. In the context of this analysis, we will assume that a pitcher’s HPO is a function of their ERA, so our \(Y\) (dependent) variable is HPO, and our \(X\) (independent) variable is ERA. When the natural randomness of a pitcher is involved we must also add some amount of error \(\epsilon\). Thus,

\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \] where \(i\) refers to a specific value of our independent variable (ERA). The regression line \(Y_i = \beta_0 + \beta_1X_i\) is the expected value of \(Y_i\) given \(X_i\) since the epsilon error term can be assumed to be a normal random variable with \(\mu = 0\) and \(\sigma_\epsilon\) (“Simple Linear Regression” (n.d.)). Although there is an independent and dependendent variable, this model does not imply that X causes Y, or vice versa. This analysis simply looks for an association between the two variables, not a causal relationship. This line of best fit is built on a variety of assumptions about the probability distribution of \(\epsilon\), since it is needed to estimate the parameters \(\beta_0\) and \(\beta_1\). These assumptions are as follows (Mendenhall (2016)):

  • \(\mu_\epsilon = 0\); the mean of \(\epsilon\) is 0
  • \(V(\epsilon|X_i) = \sigma^2_\epsilon\); the variance of \(\epsilon\) does not depend on \(X\). No matter the value of X, the variance of \(\epsilon\) is constant.
  • \(\epsilon \sim N\); \(\epsilon\) is normally distributed.
  • \(\epsilon\) at any two observations are independent. Thus, the error at one observation has no effect on the errors at other observations.

5 Estimating Parameters

In order to estimate the parameters \(/beta_0\) and \(\beta_1\), I will now use the Method of Least Squares. This involves finding the line in which the deviations from the expected value of the model (the regression line) are minimized. If all of the assumptions about \(\epsilon\) which I outlined previously are true, the Method of Least Squares produces a line which is identical to the Method of Maximum Likelihood (Mendenhall (2016)). Since the population least squares line is represented by \(Y_i = \beta_0 + \beta_1X_1 + \epsilon_i\), my predicted line base on the 2019 MLB season (my sample) will be \(\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i + \epsilon_i\). For each \(X_i\), I am able to calculate a residual \(\hat{\epsilon}\), which is the deviation of the data point from my least squares regression line. \[ \hat\epsilon = (y_i - \hat{y_i}) = (y_i - \hat{\beta_0} + \hat{\beta_1}x_i) \] Therefore, the least squares regression line will be the one which minimizes the sum of the squares for error (SSE). The SSE is given by

\[ \text{SSE} = \sum\limits_{i = 1}^n \hat{\epsilon}^2= \sum\limits_{i = 1}^n (y_i - \hat{\beta_0} + \hat{\beta_1}x_i)^2 \] I am searching for the least-squares estimates of the population parameters \(\beta_0\) and \(\beta_1\) which minimize the SSE. I expect that when I find these values, the residuals will be normally distributed with a mean of 0 and a constant variance.

hpo.lm <- with(pt, lm(HPO ~ ERA))
sm <- summary(hpo.lm)
betas <- sm$coefficients
rsq <- sm$r.squared
rsqa <- sm$adj.r.squared

The values predicted by this linear model are \(\hat{\beta_0} = 0.1953\) and \(\hat{\beta_1} = 0.0282\). This model also has an \(r^2\) value of \(0.5733\). \(r^2\), or the Coefficient of Determination, is a measure of the proportion of the variation in the population is accounted for in your model (Mendenhall (2016)). It will become important further in this analysis, when we analyze whether this linear model is a useful predictor of a pitcher’s HPO. It’s adjusted \(r^2\) value is \(0.5728\), which as you can see is very close to the \(r^2\) value. Adjusted \(r^2\) also takes into account the amount of data points in the set.

6 Check for outliers

g <- ggplot(data = pt, aes(x = ERA, y = HPO))
g = g + geom_point(col = "orange")
g = g + stat_smooth(method = "lm", col = "blue")
g

As you can see, the plot of our data with the regression line looks significantly different than our preliminary plot. This is because I removed all pitchers who had pitched less than 30 full games worth of innings. At this point, however, I would like to take a more mathematical approach to removing outliers, in order to create the most accurate linear model that I can. This dataset lends itself to outliers since baseball position players sometimes pitch an inning or two in certain situations.

To find these outliers, I will use a metric called Cook’s distance. This is a way of quantifying the influence that a data point has. Cook’s distance takes into account the size of each datum’s residual, as well as the leverage that a data point has on the regression line (“Cook’s Distance” (2021)).

cd <- cooks.distance(hpo.lm)
cooksdf <- as.data.frame(cd)
cooksdf$obs <- c(1:length(cd))

cp <- ggplot(cooksdf, aes(x = obs, y = cd))
cp = cp + geom_col(col = "orange")
cp = cp + xlim(0, length(pt$playerID) + 5) #+ ylim(0, 1)
cp = cp + ggtitle("Cooks Distance Plot") + xlab("Observation Number") + ylab("Cook's Distance")
cp

As you can see from the plot, there are a few observations which have an excessively large effect on the regression. I will remove any datum which has a Cook’s Distance larger than \(\frac{1}{873} \approx 0.001\). The reason I chose 0.004, is because heuristically, a cook’s distance over \(\frac{1}{n}\), where \(n\) is the number of observations, is too high (Zach (2020)).

cd <- cooks.distance(hpo.lm)
cooksdf <- as.data.frame(cd)
cooksdf$obs <- c(1:length(cd))

cp <- ggplot(cooksdf, aes(x = obs, y = cd))
cp = cp + geom_col(col = "orange")
cp = cp + xlim(0, length(pt$playerID) + 5) + ylim(0, 0.1)
cp = cp + ggtitle("Cooks Distance Plot") + xlab("Observation Number") + ylab("Cook's Distance")
cp = cp + geom_hline(yintercept = (1/873), col = "blue")
cp

7 Development of model without Outliers

#remove data with an excessively high Cook's Distance
baddata <- subset(cooksdf, cd >= (1/873), select = obs)$obs
ptf <- pt[-c(baddata),]

ptf <- subset(ptf, is.finite(HPO))

hpof.lm <- with(ptf, lm(HPO ~ ERA))
smf <- summary(hpof.lm)
betas <- smf$coefficients
rsq <- smf$r.squared
rsqa <- smf$adj.r.squared

In the above block of code, I have produced another linear model, this time with the potential outliers removed. The values predicted by this linear model are \(\hat{\beta_0} = 0.1844\) and \(\hat{\beta_1} = 0.0302\). This model also has an \(r^2\) value of \(0.6901\). It has an adjusted \(r^2\) value of \(`r round(rsqa, 4)\), however. Adjusted \(r^2\) is a similar metric to \(r^2\), in that it is a measure of how well the model fits the data. Adjusted \(r^2\), however, accounts for the amount of data points in your data set.

g <- ggplot(data = ptf, aes(x = ERA, y = HPO))
g = g + geom_point(col = "blue")
g = g + stat_smooth(method = "lm", col = "orange")
g = g + ggtitle("Scatterplot with Outliers Removed")
g

As you can see, some data points have been removed as outliers. That data point near the edge was not removed despite having high leverage. This is because it is so close to the line that it has a very low residual.

The higher \(r^2\) value of this cleaned data indicates that it’s a better fit for the data.

8 Validity with mathematical expressions

Now that I have an estimate for the values in my linear model, it’s important to analyze whether the assumptions that this was built on are true. This section will focus on the theory behind verifying assumptions; the next section will focus on actually verifying the assumptions in this data set.

If you recall an earlier section, there are four assumptions on which my model is built. These assumptions involve the distribution of the error, specifically that it is normally distributed, with a mean of zero, and a constant variance. Additionally, all errors should be independent of all other errors. To verify these assumptions, I will first use the residuals to find the Residual Sum of Squares (RSS). It is calculated using the following equation, where \(\hat{y}\) is a value predicted by our regression line, and each \(y_i\) is a data point.

\[ \text{RSS} = \sum\limits_{i=1}^n (\hat{y}_i - y_i)^2 \]

After this, I will find the Model Sum of Squares (MSS), which is the sum of the distances between the model’s predicted value and the mean of the dependent variable at each data point, squared.

\[ \text{MSS} = \sum\limits_{i=1}^n (\hat{y}_i - \overline{Y})^2 \]

The third metric I would like to find is the Total Sum of Squares (TSS), which is the sum of the squared differences between each data point and the mean of the data.

\[ \text{TSS} = \sum\limits_{i=1}^n (y_i - \overline{Y})^2 \]

In order to better explain these variables and their relationship. The \(R^2\) value which I mentioned earlier is calculated as \(\frac{\text{MSS}}{\text{TSS}}\).

After this analysis is complete, I intend to plot the residuals to show that there is no obvious pattern in them. Then, I will use the Shapiro-Wilk test for normality to show that the errors are normally distributed.

9 Actual tests for Validity

In this section, I will put the theory introduced in the last section to work on our data set. The scatterplot at the end of the previous section seems to illustrate that a simple linear model is a good fit for the data, but many statistics exist to test whether this model is the best one, and to test whether the assumptions that I made when creating the model and removing its outliers are true.

9.1 Residual Plot

e <- ggplot(ptf, aes(x = ERA, y = HPO))
e = e + geom_point()
e = e + geom_smooth(method = 'lm')
e = e + geom_segment(aes(x=ERA, xend=ERA, y=HPO, yend=hpof.lm$fitted.values), col = "orange")
e

The residuals, which were used to calculate RSS, and therefore TSS can be visulized as the distance between each point and the regression line, drawn on the plot above. It is the distance between \(y\) and \(\hat{y}\) for each data point Our plot was constructed so that the sum of the squares of the distances of these line segments is minimized.

9.2 Means Plot

e <- ggplot(ptf, aes(x = ERA, y = HPO))
e = e + geom_point()
e = e + geom_smooth(method = 'lm')
e = e + geom_segment(aes(x=ERA, xend=ERA, y=hpof.lm$fitted.values, yend=mean(hpof.lm$fitted.values)), col = "orange")
e = e + geom_hline(aes(yintercept = mean(hpof.lm$fitted.values)), col = "blue")
e

The above plot is a visualization of the mean sum of squares (MSS) of the model. Ultimately, the sum of the above plotted line segments is the MSS, which can be used to calculate the \(R^2\) value of the model. It is the distance from \(\hat{y}\) to \(\bar{y}\) for each data point.

9.3 TSS Plot

e <- ggplot(ptf, aes(x = ERA, y = HPO))
e = e + geom_point()
#e = e + geom_smooth(method = 'lm')
e = e + geom_segment(aes(x=ERA, xend=ERA, y=HPO, yend=mean(hpof.lm$fitted.values)), col = "orange")
e = e + geom_hline(aes(yintercept = mean(hpof.lm$fitted.values)), col = "blue")
e

The Last metric, the total sum of squares is the distance between \(y\) and \(\bar{y}\) for every data point. These three metrics can be used to calculate the \(R^2\) value of this model (covered in the previous section), which R auto-magically did behind the scenes.

9.3.1 Error Assumptions

Now I need to verify a 4 assumptions that have to do with my error, all of which are encapsulated in the statement below.

\[\epsilon_i \stackrel{\text{iid}}{\sim} N(0,\sigma^2)\]

In particular, I need to verify that the errors \(\epsilon\) are indeed normally distributed, do actually have a mean of 0 (or reasonably close to 0), and have a constant variance. Additionally, all errors should be independent of all other errors.

9.4 Normality, Independence, and Constant Variance

s20x::normcheck(hpof.lm, shapiro.wilk = TRUE)

In this test, the null hypothesis \(H_0\) is that the data is normally distributed. \(p > \alpha = 0.05\), so we are unable to reject \(H_0\) and conclude that it’s plausible that the errors are normally distributed, with a constant mean and variance.

9.5 Mean of 0

mueps <- mean(hpof.lm$residuals)

The mean of the errors is \(-9.3992746\times 10^{-19}\), which is so close to zero that my assumption is effectively satisfied.

10 Assesing the Use of a Linear Model

10.1 Loess Smooth Plot for the data

f <- ggplot(data = ptf, aes(x = ERA, y = HPO))
f = f + geom_point()
f = f + geom_smooth(method = "loess", col = "blue", fill = "orange")
f = f + ggtitle("Smooth Trend Line for Pitcher's HPO vs ERA with Outliers Removed")
f

As you can see, with our adjusted data set, a smooth trendline can be drawn. The line on this plot is not a regression line. It’s simply a smooth-ish line drawn using an algorithm called the LOESS Algorithm. The orange band shown on this plot is the area where a regression line could be reasonably drawn. It’s most visible on this plot towards the right side, where there are fewer data points from which a residual can be drawn. As you can see, the trend line seems like it should be one straight line, showing that my choice of model was appropriate. It also has a reasonably small margin of error.

10.2 Residual Plot

h <- ggplot(data = ptf, aes(x = ERA, y = hpof.lm$residuals))
h = h + geom_point(col = "blue")
h = h + geom_hline(yintercept = 0, col = "orange")
h = h + ylab("Residuals")
h

As you can see from the plot above, there is no clear pattern in the residuals. The mean of the residuals seems to be roughly 0 (where I’ve inserted a horizontal line), and it’s pretty much symmetrical on the y axis. This shows that there is no significant deviation from the pattern accounted for by the model.

10.3 Residuals plotted versus fitted values

j <- ggplot(data = ptf, aes(x = hpof.lm$fitted.values, y = hpof.lm$residuals))
j = j + geom_point(col = "orange")
j = j + geom_smooth(method = "loess")
j = j + xlab("Fitted Values") + ylab("Residuals")
j

The uniformity that we see around 0 shows that this linear model is appropriate for the dataset. The large curve away from 0 could be construed as evidence of needing a different model, but in the context of our analysis, I don’t think this is the case. The reason that the model is pulled in this direction is because of that single data point, a pitcher named Ryan Dull, who had an abysmal ERA of 19.29 during the 7 outs which he earned with the New York Yankees (“Ryan Dull Stats” (n.d.)). He wasn’t removed as an outlier because the cook’s distance of his data point was small enough. This means that his season did not have a big enough effect on the overall trend line to be considered an outlier. Forcing our model to more closely represent his season, however, would likely cause overfitting of the data and worsen our model’s predictive power, since he only pitched for a very small amount of time.

11 Analysis of Linear Model

11.1 Point Estimates and Model Summary

summary(hpof.lm)
## 
## Call:
## lm(formula = HPO ~ ERA)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.108280 -0.028982  0.000372  0.028104  0.111980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.184399   0.003908   47.19   <2e-16 ***
## ERA         0.030246   0.000745   40.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04098 on 740 degrees of freedom
## Multiple R-squared:  0.6901, Adjusted R-squared:  0.6897 
## F-statistic:  1648 on 1 and 740 DF,  p-value: < 2.2e-16

Above is a printed summary of the linear model, from which we can quantify our regression line and make predictions. Our regression line is \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x = 0.1844 + 0.0302x\)

The point estimates for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are given by this summary, too. They correspond to \(\hat{\beta_0}\) and \(\hat{\beta_1}\) in the above regression line. They represent the estimates of these two values, but in the next section, we will use a confidence interval to make a probabilistic statement about these values.

11.2 Confidence Interval

s20x::ciReg(hpof.lm, conf.level=0.95, print.out=TRUE)
##             95 % C.I.lower    95 % C.I.upper
## (Intercept)        0.17673           0.19207
## ERA                0.02878           0.03171

The above function constructed a 95% confidence interval for the values \(\hat{\beta_0}\) and \(\hat{\beta_1}\). This means that if we used the same process to construct a regression line for data sampled from the same population (a different baseball season, perhaps) over and over, an effectively infinite number of times, 95% of the confidence intervals constructed would contain the true population values for \(\hat{\beta_0}\) and \(\hat{\beta_1}\), and 5% of them would not. This can be more concisely communicated by saying that we are “95% confident” that the population parameter falls in this interval, though this can cause confusion since the confidence interval does not imply a probability.

Regardless, we are 95% confident that the true population value for \(\hat{\beta_0}\) falls between 0.17673 and 0.19207, and that the true population value for \(\hat{\beta_1}\) falls between 0.02878 and 0.03171.

12 Predictive Use of the Model

Now that the preceding analysis has been completed, it’s time to apply the model on a practical level. If I choose a pitcher, I should be able to use this regression line to predict the rate at which they will give up hits, and therefore compare them to other pitchers and increase my liklihood of keeping my streak.

Since it’s currently the baseball off season, I will look retroactively at a call that had to be made without the help of this model. Take for instance August 1st, 2021. On this day, the Milwaukee Brewers were scheduled to play the Atlanta Braves, who were starting the pitcher Charlie Morton, who at the time had an ERA of 3.69. The lead off hitter for the Brewers, Tyrone Taylor, would likely face Morton a few times, so knowing the HPO of the pitcher is very useful.

predict(hpof.lm, data.frame(ERA=c(3.69)))
##         1 
## 0.2960087

The model predicts that Charlie Morton will give up hits at a rate of 0.2960087 hits per out, or about 3 outs pitched before he gives up a hit.

On the same day, Jean Segura was gearing up to be the lead off batter for the phillies, against Pirate’s Starting Pitcher Mitch Keller, who at the time had an ERA of 7.05.

predict(hpof.lm, data.frame(ERA=c(7.05)))
##         1 
## 0.3976367

Our model predicts that Keller will give up hits at a rate of 0.3976367 hits per out, or around 2.5 outs pitched before he gives up a hit.

Over the course of a game, this difference in HPO shows in the other teams hits. Jean Segura had a batting average of 0.266 during the 2020 season, and Tyrone Taylor had a batting average of 0.237, eerily close to Segura’s. However, If I had run this model for the pitchers that both players would certainly face at least once (likely multiple times), it would have favored choosing Segura over Taylor. In reality, Jean Segura got two hits on August 1st, while Tyrone Taylor got no hits.

Jean Segura, getting a hit in August of 2021

There are, of course, examples where my model makes a prediction that ends up breaking my streak rather than maintaining it, but on a higher level, the information from this model can be used in a meaningful way to help choose batters and (hopefully) maintain my streak.

Data about Jean Segura, Tyrone Taylor, Charlie Morton, and Mitch Keller for this section was manually collected from both mlb.com and baseball-almanac.com. (“The Official Site of Major League Baseball” (n.d.)) (Baseball Almanac (n.d.))

13 Conclusion

13.1 Research Question and Implications

I set out to find a relationship between a pitcher’s ERA, and the rate at which the same pitcher gives up hits. Based on the SLR which I have performed, a linear model is appropriate and allows for meaningful predictions to be made. Although there are many other factors which may also be factored into a batter’s ability to get a hit, the information from this model will actually help me with maintaining my streak, since it allows for an analysis of a pitcher’s average propensity to give up hits given their ERA.

13.2 Suggest ways to improve model or experiment

At the end of the day, the practical use for this entire analysis is to be able to predict the rate of hits that a pitcher gives up, given their ERA, which is a standard and ubiquitous statistic. For my purposes (coming closer to beating the streak), further analysis will still have to be done on different sabermetric data to predict whether a batter will face a pitcher, and what other factors may be affecting the pitcher or the batter.

When I constructed the preliminary plots section, I was surprised to find that there didn’t seem to be any meaningful relationship between a pitcher’s spin rate and the amount of hits they give up. I was surprised by this, since throughout the past season, there has been lots of news about how the MLB had been cracking down on adhesive substances which professional pitchers sometimes use to alter the spin rate of their pitches and gain an edge in their game (Baccellieri (2021)). It was pointed out to me recently, however, that the average spin rate doesn’t give you meaningful information since many pitchers rely on an arsenal of different kinds of pitches, all of which are meant to fly differently. In the future, I would like to perform an analysis of how pitcher’s spin rate affects their HPO, but I want to stratify the data based on pitch type. This analysis, combined with a simple observation of what proportion of each pitch type a pitcher tends to throw, could have significant predictive power which isn’t shown in more common statistics such as ERA.

Also, because of the aforementioned actions of the MLB against adhesive substances, I’d imagine that pitchers may develop different trends, so the predictive power of the 2019 season may be nominal when applied to the 2022 season and beyond.

In summary, although this model is useful, it isn’t the end-all of HPO prediction, and I intend to conduct more thorough analyses on different factors which affect hits, preferably with a future season. It’s likely that the final product that I’m after is a model which takes into account many factors which baseball experts speculate might affect hitting. Since baseball fields have nonstandard dimentions, for instance, some quantification of the ballpark may be necessary (“Statcast Park Factors” (n.d.)). Additionally, baseball is a mental game, and often the skill of a player ebbs and flows throughout the season. Some kind of rolling-average metric may be necessary to effectively model a player’s hitting. Player’s performances can also be modeled to follow “age curves” which can be predicted by comparing players to past players who were statistically similar to them ((n.d.)). Working these factors into the model may also turn out to increase it’s accuracy.

In summary, this analysis is useful, but only to a point. I’ve only scratched the surface of sabermetrics, and if I want to have a reasonable probability of beating the streak, it’s going to take more than just ERA to get the job done. This is the end of this analysis, but it’s just the beginning of a more rigorous model which I intend to construct using MLB data.

References

n.d. https://www.billjamesonline.com/aging_patterns/.
Baccellieri, Emma. 2021. “Five Takeaways One Month into Sticky-Stuff Enforcement.” Sports Illustrated. Sports Illustrated. https://www.si.com/mlb/2021/07/21/sticky-stuff-crackdown-one-month-the-opener.
Baseball Almanac, Inc. n.d. “Baseball Almanac 2021: Baseball History, Baseball Records and Baseball Research.” Baseball Almanac. https://www.baseball-almanac.com/.
“Cook’s Distance.” 2021. Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Cook%27s_distance.
Douglas, Joe. 2017. “Ottoneu 101: Plate Appearances by Lineup Spot.” RotoGraphs Fantasy Baseball. https://fantasy.fangraphs.com/buying-generic-plate-appearances-by-lineup-spot/.
Lewis, Michael (Michael M.). 2003. Moneyball : The Art of Winning an Unfair Game / Michael Lewis. 1st ed.. New York: W.W. Norton.
Mendenhall, William M. 2016. Statistics for Engineering and the Sciences / William m. Mendenhall, Terry l. Sincich. Sixth edition..
“Official Rules: Beat the Streak.” n.d. MLB.com. https://www.mlb.com/apps/beat-the-streak/official-rules.
“Ryan Dull Stats.” n.d. Baseball. https://www.baseball-reference.com/players/d/dullry01.shtml.
Schwarz, Alan. 2004. The Numbers Game : Baseball’s Lifelong Fascination with Statistics / Alan Schwarz. 1st ed.. New York: T. Dunne Books.
“Simple Linear Regression.” n.d. https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470721896.ch14.
“Statcast: Glossary.” n.d. MLB.com. https://www.mlb.com/glossary/statcast.
“Statcast Park Factors.” n.d. Baseballsavant.com. https://baseballsavant.mlb.com/leaderboard/statcast-park-factors.
“The Lahman Baseball Database.” n.d. SeanLahman.com. http://www.seanlahman.com/files/database/readme2017.txt.
“The Official Site of Major League Baseball.” n.d. MLB.com. https://www.mlb.com/.
Wikipedia. 2021. BaseballWikipedia, the Free Encyclopedia.” http://en.wikipedia.org/w/index.php?title=Baseball&oldid=1049486124.
Zach. 2020. “How to Identify Influential Data Points Using Cook’s Distance.” Statology. https://www.statology.org/how-to-identify-influential-data-points-using-cooks-distance/.

  1. This is because the probability for getting a hit that day follows a binomial distribution with 4 trials, and a probability of the batting average for each trial (0.512 in Pemberton’s case). The R command used to find this answer is 1-dbinom(0,4,0.512)↩︎

  2. This is because the probability of choosing 57 correct days in a row can be estimated with the geometric distribution, with 1- ~94% (or ~6%) as the probability, and 57 as the number of failures before 1 “success” (or in this case day without a hit) occurs. The R command used to find this answer is dgeom(57,1-pb), where pb is an object containint the exact probability of Pemberton getting a hit in a day as calculated in the previous paragraph.↩︎